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Broadband noise in gravitational wave (GW) detectors, also known as triggers, can often be a 
deterrant to the efficiency with which astrophysical search pipelines detect sources, ft is important to 
understand their instrumental or environmental origin so that they could be eliminated or accounted 
for in the data. Since the number of triggers is large, data mining approaches such as clustering and 
classification are useful tools for this task. Classification of triggers based on a handful of discrete 
properties has been done in the past. A rich information content is available in the waveform or 
'shape' of the triggers that has had a rather restricted exploration so far. This paper presents 
a new way to classify triggers deriving information from both trigger waveforms as well as their 
discrete physical properties using a sequential combination of the Longest Common Sub-Sequence 
(LCSS) and LCSS coupled with Fast Time Series Evaluation (FTSE) for waveform classification and 
the multidimensional hierarchical classification (MHC) analysis for the grouping based on physical 
properties. A generalized k-means algorithm is used with the LCSS (and LCSS+FTSE) for clustering 
the triggers using a validity measure to determine the correct number of clusters in absence of any 
prior knowledge. The results have been demonstrated by simulations and by application to a segment 
of real L1GO data from the sixth science run. 

PACS numbers: 95.85.Sz,04.80.Nn, 07.05.Kf, 02.50.Tt, 02.60.Pn 
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I. INTRODUCTION 

Gravitational- Wave (GW) detectors viz., LIGO [jj, 
Virgo [2j have been online in data-recording mode since 
2000. LIGO has concluded its sixth science run (S6) in 
2010. Data have been archived from not only the GW 
channels, but also from several hundreds of auxiliary and 
environmental channels. There are two main broad cate- 
gories in which data analysis has been organized: astro- 
physical searches and detector characterization Q. 
These two tasks are not entirely independent. Detector 
characterization research products prepare the ground 
work for understanding the underlying noise and feed the 
astrophysical searches with information Q that symbol- 
ize which data segments are relevant for GW search. 

Data from the GW detectors have both broadband and 
narrowband noise. The narrowband noise (aka lines) is 
extensively studied and several methods [9, 10] have been 
implemented to efficiently remove them from the data. 
However, reduction of the broadband noise (aka trig- 
gers) is a difficult problem that has not been explored 
to the fullest yet. The sensitivity of the detectors has 
improved steadily over the years [ll|. With each step 
towards a more sensitive instrument, many new sources 
of noise have also been unearthed. Occurence of trig- 
gers in the data is a function of the operating condi- 
tions. Thus it is important to track down the sources 
to make the output data as high quality as possible and 
to reduce probability of false alarms. A very large effort 
has been undertaken to analyze the noise transients in 
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the GW channels (referred to as DARM_ERR channel) 
in the LIGO detectors [l]| . These consist of looking at 
triggers in time frequency planes 1131. exploring loudest 
triggers seen in burst pipeline s Il4l.ll5ll , studying low fre- 
quency seismic disturbances 16|, [17[ , looking at specific 
types of triggers e.g. from photodiodes Q and exploring 
structures present in the trigger population in dimensions 
higher than the usual three dimensional cartesian system 
using multidimensional hierarchical classification (MHC) 
methods [IH, [l8|, [l!| . These methods are complementary 
and cast light on different aspects of the triggers seen in 
the data. 

GW data are archived in the format of discretely sam- 
pled time series from the main GW channel as well as 
from several hundreds of instrumental and environmental 
channels that are recorded specifically to monitor func- 
tioning of different instrumental subsystems and environ- 
mental activities that affect the GW channel data. The 
triggers arrive at a high rate in all channels. This requires 
data mining methods to keep up with near realtime in- 
formation and to process the enormous volume of data 
for information extraction. Classification is the most ef- 
fective way of addressing this problem. The methods of 
classifying large data sets in multidimensional parameter 
space bring an immediate reduction in the dimensional- 
ity of the problem under the assumption that existing 
classes show some common collective properties. In the 
context of triggers seen in the GW data, we would like to 
explore how many different classes of triggers arc present 
and how to characterize these classes in terms of their 
origin. Development of a knowledge base in understand- 
ing the properties of the triggers thus seen in GW data 
contributes towards development of realistic noise mod- 
els that are essential in proper assessment of peformance 
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of astrophysical search pipelines. 

There are several analysis pipelines that operate on- 
line on LIGO data to look for burst-like signals or trig- 
gers e.g. kleine welle (KW) [lj|, omega (OP) [29| and 
waveburst (WB) [2(J. The KW pipeline works on multi- 
ple channels - the GW channel and several hundreds of 
auxiliary and environmental channels. The threshold is 
kept such that the pipeline picks up triggers of all types 
at a steady rate. Unsupervised data mining methods like 
the MHC analysis [HI, Ell has been developed and LIGO 
science data have been analyzed in recent past. The aim 
of these studies has been to classify the population of 
triggers seen in the GW, environmental and auxiliary 
channels into statistically significant distinct groups with 
uniform characteristics. These studies have been mostly 
based on a handful of discrete properties of the triggers 
viz. duration, central frequency and signal-to-noise ratio 
(snr). However, an important aspect of the triggers, viz. 
the 'shape' factor has largely been overlooked. Shape 
of the triggers, or the waveform, often contains rich in- 
formaton. Temporal classification [22T[25l| methods e.g. 
S-means and Constraint Validation [21| have been devel- 
oped in the recent past and studied on simulated GW 
data. However, they have not been tested on data from 
real GW detectors. 

Another successful and often applied method is based 
on distances calculated using the Longest Common Sub- 
sequence (LCSS) (26[ . The technique has been studied in 
GW data for the first time in a recent publication (l9j . 
Preliminary results demonstrated have shown production 
of trigger clusters with similar shapes. The current paper 
explores fast, accurate and efficient methods for unsuper- 
vised classification of trigger waveforms further. An anal- 
ysis pipeline has been constructed based on LCSS and 
also LCSS in conjunction with Fast Time Series Evalua- 
tion (FTSE) [23|. The latter is done to explore possibil- 
ities of increasing computational speed when very large 
datasets of triggers are involved. This is the first ex- 
ploration of combined LCSS and FTSE methods in the 
context of GW data analysis. A second stage involving 
the MHC methods [lU, [l8[ is carried out to check the ho- 
mogeneity of the clusters in the parameter space of their 
physical properties. This results in further segmenta- 
tion of the triggers to appropriate them to their sources 
- instrumental or environmental. Some of the specific 
questions we explore in the paper are (i) Are LCSS and 
LCSS+FTSE suitable methods for fast and accurate trig- 
ger classification? (ii) How are the resulting classes of 
triggers characterized? (iii) What are the computational 
costs involved? (iv) How robust are these methods for 
GW trigger classification? (v) How can the analysis give 
relevant information for tracking down sources of non- 
GW triggers? 

It is shown in this study that application of the LCSS 
(or LCSS+FTSE) followed by MHC is a very useful and 
productive way to classify triggers based on their wave- 
forms and characteristic physical properties. As has been 
found in the study with S6 data, the end result of the 



pipeline produces trigger classes with similar waveforms 
and amplitude, central frequency, Q-factor and snr range. 
Each of these classes of triggers is shown to be related to 
a group of auxiliary and environmental channels indicat- 
ing the most probable sources of their origin. This com- 
bined classification pipeline performs better than meth- 
ods based either only on discrete trigger properties or on 
trigger waveforms alone. Thus it can be turned into an 
effective, low latency trigger identification tool for GW 
detector characterization. 

This paper is an illustration of the method and its ad- 
vantages. Results from the sample S6 data chosen over a 
two day period are used to show how the method could 
be applied in science runs to extract information comple- 
mentary to and in conjunction with the existing meth- 
ods with low latency. The final outcome of this analysis 
shows existence of several statistically significant classes 
of triggers with distinct waveforms and physical proper- 
ties coming from the GW channel in the test data set 
from S6. Post-classification analysis explores the cou- 
plings of these trigger classes to different sets of auxiliary 
and environmental systems. Application of LCSS and 
LCSS+FTSE (which yields classes based on waveforms) 
alone would give 19 distinct classes, while application of 
only the MHC analysis would have given 3 statistically 
significant classes. Thus, the proposed analysis clearly 
has advantage in using a bigger parameter space leading 
to a finer classification structure that helps in the iden- 
tification of triggers by maximum utilization of its in- 
formation content. Since each of the subgroups contains 
triggers with very characteristic properties related to a 
specific set of channels, the method proves useful in the 
classification of triggers seen in GW data and in helping 
with tracking down the sources or origins of the triggers. 
As a direct application to detector characterization, we 
can classify the triggers seen in GW science data into 
different groups with characteristic properties, related to 
specific instrumental and environmental sources. We can 
thus study the trend of various kinds of triggers and gain 
insight into how some of the channels may be reponsiblc 
in production of specific types of triggers. 

The paper is organized as follows. Sec. |H] describes 
the trigger generation process and Sec. IIIII describes the 
FTSE and LCSS algorithms. We explain the pipeline 
for generation of trigger clusters in Sec. [IVJ Sec. M then 
presents results from numerical simulations and applica- 
tions to S6. Conclusions and directions to future work 
are presented in Sec. IVII 



II. TECHNIQUES FOR DETECTING 
TRIGGERS IN LIGO DATA 

Techniques for detection of burst-like triggers in the 
GW instrumental output data stream are described con- 
cisely in an earlier paper [l5| . In general, such techniques 
project the data onto a basis that spans the parameter 
space of the burst-like signals. A measurement becomes 
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optimal when there is an exact match between a member 
in the contracted basis and a burst. The snr g in this 
case is defined as 



where \ \h\\ 2 is the total energy content of the signal and 
Sh(f) is the one-sided power spectral density. In cases 
where a close match between the signal and the member 
of the basis is not achieved, the bursts are characterized 
by a quality factor Q defined as 

Q = — (2) 

where f c is the central frequency and a 2 is the band- 
width. The time-frequency plane is thus titled by the 
Q values where the signals are represented by localized 
pixels with the same Q. 

III. ALGORITHMS FOR TRIGGER 
CLASSIFICATION IN A MULTIDIMENSIONAL 
SPACE 

There are two main issues that one must address 
while developing an efficient feature-based classification 
method (i) the data may have temporal gaps, i.e. there 
may be same pattern occurring at different time epochs 
and (ii) the classification preferably should be automatic 
(to keep up with the online feedback systems) and thus 
unsupervised classification methods that are robust to 
noise need to be explored. 

A. Longest Common Sub Sequence 

Let us first take a look at why LCSS algorithm is ef- 
ficient and how it fits into GW data analysis. In any 
unsupervised classification method, the first step is to 
calculate the distance between points in the parameter 
space. Even though Euclidean distance is the most com- 
monly used method, it is not suitable for addressing the 
first difficulty mentioned above. Thus, two triggers that 
have the same waveform may show a high Euclidean dis- 
tance if they are not occurring simultaneously. This will 
be considered a redundant cluster from the physical point 
of view. LCSS algorithm is able to compute a match be- 
tween two time series by calculating metrics for triggers 
that do not necessarily occur at the same time without 
having to rearrange the sample sequences to coincide. 

As an example, let us consider a sequence of charac- 
ters x m , x„, x p , x p , x q , x n , x q . A subsequence is defined 
as a set of characters that appear in an order from the 
left to the right, but not necessarily consecutively. Thus, 

^mi^ni^pi [^m^pt^pi^gi ' [*^m : : -^p: %pi •Kq: »^n] 

are subsequences, but [x p , x p , x n ] is not a subsequence. 
A common subequence of two sequences is a subsequence 



that appears in both sequences. A longest common sube- 
quence (LCSS) is a common subsequence of maximal 
length. For example, suppose 

si = uuuvvwtwuwttuttvwttvtuvvuu (3) 

and 

S2 = vuvvwtuuwwt\xvvttt~wwttv, (4) 

an LCSS (denoted by LCSS(si,s 2 )) is given by 
uvvtuwtuvtttw. The algorithm operates on the prin- 
ciple of enumeration all subsequences of s\, followed by 
checking if they are subsequences of S2 as well. 

Let us now look at the theory of LCSS in the context 
of the trigger waveforms in the present study. Formally, 
this amounts to comparing two input trigger time se- 
ries sequences X(l . . . m) and Y(l . . .n), where m and 
n denote the length of the sequences X and Y respec- 
tively. The length of LCSS of X and Y (or written as 
LCSS(X, Y)) will be denoted by £. The recurrence rela- 
tion [47| leading to the length of the LCSS for each pair 
[X(l . . . i),Y(l . . . j)] is given as follows: 

a(i,j) = (5) 

if X or Y is empty sequence, i.e. if i = or j = 0; 

a{i,j) = l + a{i-l,j-l) (6) 

HX(i)=Y(j); 

a(i,j) = max{a(i - l,j),a(i,j - 1)} (7) 

i£X(i) ^ Y(j), where 

a(i,j) = a[X(l...i),Y(l...j)}). (8) 

The algorithm is shown graphically in figure[T] A single 
a value is localized in the sense that it depends only on 
the three neighboring values. After the table has been 
filled, the length of the subsequence is found in 

a(m, n) = (([X(l ...m),Y(l... n)]). (9) 

The common subsequence is found by backtracking from 
a(m, n) by following at each step, the pointers that are 
set during the calculation of the values. When a match 
is found, the LCSS is upgraded. In this way, one can 
traverse a path through the LCSS table (as shown in[T|) 
until a length of zero is reached. In this case, C(^> Y) 
is =4 and the LCSS corresponding to the path shown is 
uvuu. 

The algorithm works as follows. A pair defines a 
match if 

X(i) = Y(j). (10) 
The set of all matches is given by 

7/ = {(i,j)\X(i) = Y(j), 1 < i < m,l < j < n}. (11) 
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FIG. 1: The figure shows graphically how the LCSS algo- 
rithm works. The two-dimensional array with the two se- 
quences X and Y is shown along the two axes. Initially the 
cells in the array have uniform entries of zeros. In the next 
step, we look for an element to element match. Whenever a 
match is found, the cell entry in incremented by one. The se- 
quence of arrows show a possible LCSS path. The LCSS path 
in this case indicates that there is a match of four elements 
in the sequence viz., uvuu [3l], E3l- More details are given in 

imi 

Each match belongs to a class 

= {(MIM € t};a{i,j) =k},l<k<(. (12) 

A match belonging to the fife is called a &;-match. In the 
figure [H the marked entries define the class fife. Since 
each match belongs to only one class, these classes par- 
tition all matched of r\. 

Some fc-matches are more useful algorithmically (e.g. 
square marks in the figure [l} than the others (e.g. circle 
marks in the figure[T|) . This can be proven as follows. Let 
us consider matches and of fife for i = i' and 

j < j' or i < i' and j = j' . Every element of fifc+i that 
follow should also follow in the LCSS. Thus, it 

is sufficient to consider only the dominant matches 
Let tfk be the set of all dominant matches. The regions 
in the figure where a(i,j) values are equal (shown by 
vertical and horizontal lines) are called LCSS contours. 
Each fc-match lies immediately below the k th contour. 
These contours are defined by an ordering property. If ifk 
— [UjJi]) b^iiaL ■ • • i \iuju ]j the matches can be renum- 
bered such that i\ < %2 < ■ ■ ■ < ii and ji > j% > ■ ■ ■ > ji- 
The strategy for locating the dominant matches is based 
on advancing from contour to contour. 



B. K-means 

Once the distances are calculated according to equa- 
tion (5), a generalized k-means [29] algorithm is employed 
to form the clusters. The reason as to the choice of k- 



means is dictated by the fact that this allows formation 
of homogeneous clusters that are insensitive to outliers. 

K-means [13, [H[ uses two parameters to start with 
the number of clusters K and the set of elements D = 
[ti, t2, ■ ■ ■ , t n \. The algorithm works as follows. 

Let M — [mi, mg, ■ ■ ■ , mk] be the set of centroids as- 
signed randomly and the size of the set M equals K. 
Each item ti is placed into the cluster which has the 
nearest mean. M is populated with a new value of 
mean for each cluster i-Q. The cluster mean of K{ = 
[til, tj2j ■ ■ • , Ui] is usually (but not necessarily) calculated 
as: 
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where / is the number of items and ti is the item placed 
in each cluster Ki. 

The last three steps are repeated until a defined conver- 
gence condition is met such as no further change in the 
membership of the clusters. 

One of the short falls of k-means is that the algorithm 
needs a predefined value of K or the number of clusters, 
which is exactly what the proposed clustering technique 
aims to know. In case of unknown data under test, the 
number of significant groups in the classification struc- 
ture is unknown and hence it cannot be supplied to K- 
means. 

Several validity measures have been developed to de- 
termine the value of K in k-means. Using the ratio of 
intra cluster distances Si n tra to inter cluster distances 
Sinter , a simple validity measure V m m is used to find out 
the optimum number of clusters [3fjj . The time series 
representation of the triggers within a cluster ( a point in 
the multivariate parameter space) must be as similar as 
possible and similar points belonging to different clusters 
must be as different as possible (in the same multivari- 
ate parameter space) to ensure compactness of clusters. 
Therefore, the intra cluster distance should be minimum 
and the inter cluster distance should be maximum. Since, 

t r b~intra „n 

vmin — ~c V 
Ointer 

the value of K which makes the validity measure min- 
imum, is the ideal one. In practice it is expected that 
the values of K found from the above method would vary 
slightly on different runs. This happens because, in the 
k-means algorithm, the centroids are assigned randomly. 
Therefore, the validity measure was computed one hun- 
dred times on the same sets of data and the most frequent 
value of K is chosen. 

The role of LCSS in this clustering scheme is to gen- 
erate an adjacency matrix containing information about 
the pairwise distances between trigger time series. Once 
the matrix is created, it can now be supplied to K-means 
which treats it as an ordinary input and operate on it 
based on the algorithm. 



Attempt to Improve Computational Speed: 
Fast Time Series Evaluation 



If we need to implement the classification algorithm as 
a near real-time tool for trigger identification, we need to 
make the process computationally fast and efficient. To 
address this question, we also investigated methods that 
could enhance the computational speed of the LCSS algo- 
rithm. We thus investigate the Fast Time Series Evalua- 
tion algorithm (FTSE) |27| in an attempt to make LCSS 
faster. LCSS in conjunction with FTSE claims to be 
faster than LCSS algorithm alone because unlike LCSS 
calculations, FTSE does not use a two dimensional array 
to compare matches between two time series. Values in 
one time series are entered into a grid. Then a point in 
the other time series probes into its respective grid cell 
(of the same grid) to check if points of the first time series 
reside there. The construction of the grid ensures that 
if points are found residing in that grid cell, they must 
match the probing point within a defined threshold. To 
put it simply, in LCSS computed with FTSE, the com- 
parison between two time series occurs only between the 
intersecting portions and that is the underlying reason 
why LCSS computed with FTSE is expected to be faster 
than LCSS - where the comparison occurs throughout 
the entire length of both time series. 

The average cost of calculating LCSS is 0(p x q) where 
p and q are the lengths of two sequences being compared. 
The average cost of FTSE computing LCSS is 0(M'+Lq) 
where M' is the number of matches, L is the longest in- 
tersection between the two sequences and q is the length 
of the probing sequence [3]. The process of computing 
LCSS with FTSE is elaborately shown in figure [2]. The 
top row represents the cells of the grid. According to 
the figure, matching elements of a series X are put into 
the same cells of the grid. The elements of the second 
series Y compared with the grid cells formed from X to 
check if points of the first time series reside there. The 
construction of the grid ensures that if points are found 
residing in that grid cell, they must match the probing 
point within a defined threshold. In this case, the element 
B of Y is found to match the second grid cell and thus the 
length of the matching subsequence is augmented by one. 
Likewise, the other elements are also probed in a similar 
manner and the total matching length between the se- 
quences are recorded. Once all the matching lengths of 
the trigger signals are calculated, the signals are clustered 
using the method described above. 
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FIG. 2: The figure shows graphically how the LCSS +FTSE 
algorithm works. The top row represents the cells of the grid. 
According to the figure, indices of the matching elements of 
a series X are put into corresponding cells of the grid. The 
elements of the second series Y are compared with the grid 
cells formed from X to check if points of the first time series 
reside there. The construction of the grid ensures that if 
points are found residing in that grid cell, they must match 
the probing point within a defined threshold. In this case, 
the element B of Y is found to match the second grid cell 
and thus the value inside the cell enters the intersection list. 
Likewise, the other elements of Y are also probed in a similar 
manner and the total matching points between the sequences 
are recorded in the intersection list [27l ]. In this example, 
the contributing indices 2, 3, 4, 6 gives the subsequence, viz., 



tion, frequency, snr and statistical significance. 

The algorithm starts with calculation of a Euclidean 
distance between vectors X m and X n defined by, 



^mn — (X m X Tj 



(14) 



The data matrix has dimension p x q , where each of p 
triggers is described by (I x q) vectors X\, X 2l ■ ■ . , X p . 
Calculation of distance is followed by computation of 
suitable measure of proximity between two groups of ob- 
jects. These are called 'linkage' criteria. We adopt the 
criterion of 'complete' linkage which measures the largest 
distance between objects in the two clusters. If N m is the 
number of objects in class m and N n is the number of 
objects in class n, and X m j is the jth object in class m, 
a complete linkage is defined as follows. 



D. Multidimensional Hierarchical Classification 
algorithm 

The MHC pipeline implements a hierarchical algorithm 
[14-19] that applies a variance minimization criterion and 
groups together burst triggers detected by the KW [11] 
pipeline based on their similarity in the higher dimen- 
sional space spanned by properties like the trigger dura- 



d(m, n) = max(A(X mj , X nk )), (15) 

with j ranging between 1, . . . , N m and k ranging be- 
tween 1, . . . , N n . A denotes the distance. This stage of 
the algorithm results in a hierarchy from p clusters with 
one object to one cluster with p objects. The choice of 
significant clusters is given by computation of the correla- 
tion coefficient, V [40]. r 2 is related to the fraction of the 
total variance accounted for partitioning into s clusters. 
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r 2 is defined as follows. If 



W, 



X,-X 



(16) 



(17) 



total | 



JVj denotes the number of members in the i class, X 
denotes the unweighted mean of the population in the 
i th class and X tota i denotes the mean of the entire pop- 
ulation HI, EH- 

The statistical significance of the classification scheme 
can be verified by using the method of Multivariate Anal- 
ysis of Variance (MAN OVA) 39] . Assuming that the un- 
derlying distribution is a multinomial mixture [4lL [42j , 
the model is given below. For an n dimensional data set 
with m clusters each with p k members, the i th trigger in 
the j th cluster gives an n dimensional vector 



Xij = n 



(18) 



fi is the population mean of the total population, Tj is the 
offset of the j th cluster mean from /x and etj is the scatter 
of the points around the eman value. The hypothesis to 
be tested is as follows. 



HO : n = t 2 = r 3 = . . . r, 
Let the sum of squares be written as 



0. 



SS = Ej^S^X* - X)(X k - xy 

and the cross products arc written as 

CP = SfeL 1 E^ 1 (Xfcj — Xk)(Xkj — Xk) J 

The test statistics are as follows: 
(i) Wilks lambda 03: 



A* 



det{CP) 



det(SS + CP) 
(ii)Pillais trace [lil ]: 

V = trace[SS x (SS + CP) 



(19) 



(20) 



(21) 



(22) 



(23) 



(iii) Hotelling Lawley's trace (or Mahalanobis D 2 statis- 
tic) m 



U = traceiCP" 1 x SS). 



(24) 



All the test statistics follow the non-central F distribu- 
tion [H. 

IV. ANALYSIS PIPELINE 

The analysis is carried out in three main stages - (a) 
using simulated triggers without additive noise; (b) us- 
ing simulated triggers with additive Gaussian white noise 
and (c) using a segment of LIGO S6 data chosen over a 
two day period. 
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FIG. 3: Five different types of simulated trigger wave- 
forms were used in the classification pipeline with LCSS and 
LCSS+FTSE. These waveforms are (from top to bottom) 
mixture sinusoids, pulse trains, noise generated with a low or- 
der ARM A model ([28(]), a simple triangular sawtooth wave 
and a chirp. The simulated set consisted of 760 waveforms 
with these shapes but varying in amplitude, frequency, rel- 
ative location on the time axis etc. The figure represents 
typical examples of each type. No noise was added to these 
trigger waveforms. The x-axis denotes time samples and the 
y-axis denotes amplitude. 



A. Simulated triggers without additive noise 



We first generate a data set of 760 simulated waveforms 
with variable parameters, each 1024 samples long. The 
waveforms are in the shapes of Mixture sinusoids, Pulse 
trains, noise generated with a low order Auto Regressive 
Moving Average (ARMA (28|) model, Triangular saw- 
tooth and chirps with varying parameters, i.e. varying 
amplitude, frequency, width, location on the time axis 
etc. These waveform models are selected to generate 
waveforms of diverse nature and shapes. The motiva- 
tion behind including ARMA-model based waveforms in 
the simulation is because it is a general scheme that can 
model various different types of waveforms (up to second 
moment), including the type of outputs we see in our GW 
detectors. The amplitude is normalized. The waveform 
database is introduced to the pipeline as the prime input. 
The waveforms are shown schematically in figure |31 The 
number of clusters is determined by k-means. 
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FIG. 4: This figure represents simulated trigger waveforms 
with added Gaussian white noise used in the classification 
pipeline with LOSS and LCSS+FTSE. These waveforms are 
(from top to bottom) mixture sinusoids, pulse trains, noise 
generated with a low order ARMA model, a simple triangu- 
lar sawtooth wave and a chirp. The figure represents typical 
examples of each type. The x-axis denotes time samples and 
the y-axis denotes amplitude. The standard deviation of the 
noise added to the trigger waveforms shown in the current 
figure is a — 0.25. 



B. Simulated triggers with additive Gaussian noise 

In the next step of this study, we generate a data set of 
760 simulated waveforms of five different types of shape 
with variable parameters, each 1024 samples long, as de- 
scribed in the previous section. The amplitude is normal- 
ized. Each waveform data stream is mixed with Gaus- 
sian white noise. The output thus is a noisy waveform, 
as shown in figure [4] The snr of the triggers are kept 
in a range of 2 and 20. This is fed to the classification 
analysis pipeline. As before, the number of clusters is 
determind by generalized k-means. 



C. LIGO sixth science run trigger database 

Having gained insight with the simulations, we now ap- 
ply the analysis pipeline to classify triggers seen in LIGO 
S6 data. We have used triggers found in the Omega trig- 
ger catalog [ill [H, |49[ during S6. Three hundred and 
forty Omega triggers from the GW channel seen in the 
test data set (with snr > 12) are selected and subjected 
to the analysis pipeline. The aim of this exercise is to see 
if this scheme of classification can find statistically signif- 



icant groups of triggers in this test population based on 
the shape of the waveforms of the triggers. Triggers in 
the Omega catalog are also described in terms of four dis- 
crete properties. These are central frequency, amplitude, 
snr and the quality factor (or Q-factor). 

Once the classification of triggers take place, we ad- 
dress the important detector characterization question: 
(i) How to characterize these triggers and (ii) What are 
the possible sources of these triggers, i.e. how do they re- 
late to the auxiliary and environmental channel triggers? 
Class characterization and trigger identification steps are 
as follows. 

(i) We take each trigger from a given sub-class that re- 
sults from the main classification structure of the GW 
triggers and take an Omega scan [48[ around the peak 
time of the GW trigger. Omega scans produce time- 
frequency plots of all auxiliary and environmental chan- 
nel data that coincides with the trigger peak time. 

(ii) A histogram is constructed to see the distribution of 
the occurrence of triggers in the auxiliary and environ- 
mental channels for all the time windows corresponding 
to triggers in a particular sub-class. The highest frequen- 
cies are recorded. 

(iii) The existing data quality flags in the LSC detector 
characterization literature Q are also noted for compari- 
son. This also shows if this method points to newer aux- 
iliary and environmental sources other than the existing 
data quality flags. 

Data Conditioning: The time series segments corre- 
sponding to the triggers are first subjected to condition- 
ing. The following sequence of operations is executed. 

(i) Selection of triggers with snr greater than 12 from 
KW database; 

(ii) Extraction of raw GW channel data, centered around 
the trigger (extracted time series noted by, say, q'i ) 

(iii) Whitening q ■ [32[ and dynamically removing 0, [l(| 
the narrowband noise present in q[. The resulting time 
series is denoted by cq[. 

(iv) Filtering cq[ with a bandwidth of Sf around f c , where 
f c is the central frequency of the trigger as noted in the 
KW catalog. The resulting time series is denoted by fc q >. . 

(v) Re-sampling fc q > to represent the appropriate band- 
width. 

As shown in the figure the trigger database is used 
to first select triggers with snr > 12. The time stamps on 
the triggers are used to extract the corresponding time 
domain raw data. The raw data are conditioned elab- 
orately to reduce noise that is mixed with the triggers 
i.e. they are whitened and then narrowband noise is sub- 
tracted from the data. The conditioned time series are 
resampled and appropriately bandpassed to record the 
waveform. Figure [5] shows the different types of wave- 
forms that have been found in the test data. These are 
fed first into the LCSS pipeline. The end product of the 
analysis is a set of individual uniform groups of trigers 
with similar shape parameters. The number of clusters is 
determind by generalized k-means. The individual clus- 
ters thus generated are further subjected to the MHC 
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FIG. 5: The figure shows the full analysis pipeline as ap- 
plied to a sample of S6 triggers in this study. The trigger 
database is used to first select triggers with snr > 12. The 
time stamps on the triggers are used to extract the corre- 
sponding time domain information or waveforms. The time 
domain data is conditioned elaborately to reduce noise that is 
mixed with the triggers i.e. first whitened and then narrow- 
band noise is subtracted from the data. The conditioned time 
series are resampled and appropriately bandpassed to record 
the waveform. These are fed first into the LCSS pipeline and 
then independently into the LCSS and LCSS-FTSE combined 
pipeline. The individual clusters thus generated are further 
subjected to the MHC pipeline for finer classification struc- 
tures and post-processing. 
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FIG. 6: The figure represents samples of various types wave- 
forms that were seen in the test data from LIGO's sixth sci- 
ence run. These trigger waveforms were obtained after appli- 
cation of the data conditioning part of the pipeline to the raw 
data. The x-axis represents time in arbitrary units and the 
y-axis represents amplitude also in arbitrary units. 



pipeline for finer classification structures that can be re- 
lated to their most probable instrumental or environmen- 
tal origins. 



V. ANALYSIS RESULTS 

The results of the analysis are demonstrated in this 
section. Figure [3] shows the different simulated trigger 
waveforms that are subjected to the analysis without any 
additive noise. Figure [7] shows the results of the classi- 
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FIG. 7: The figure represents results of the classifica- 
tion structure. K-means is run on the simulated waveform 
database 1000 times and the number of classes deemed most 
significant (by the validity measure) is recorded. The his- 
togram shows that in most of the cases, the data shows ex- 
istence of five distinct classes, which is the true number of 
classes. The x-axis denotes the number of significant classes. 
The y-axis shows the corresponding number of trials. 



fication structure. K-means (as described in section III) 
has been run on the database 1000 times and the number 
of classes deemed most significant (by the validity mea- 
sure V m in) is recorded. The histogram shows a peak at 
5 significant classes which was the true number of clus- 
ters in the data. Both LCSS and LCSS+FTSE showed 
identical classification structure. Figure[8]shows the com- 
putational speeds for LCSS and LCSS+FTSE. Contrary 
to hypothesis, LCSS+FTSE is found to be much more 
computationally intensive as the sample size grows larger 
than 120. The reason for this divergence is is explained 
inlVn 

The next set of studies are conducted with the same 
set of waveforms, but now mixed with Gaussian white 
noise with varying standard deviation a. The values of 
a varies from 0.1 to 0.8 in steps of 0.1. Figure [4] shows 
the typical trigger waveforms of various types with added 
noise. Classification is carried out by k-means as in the 
previous case. The best value of the number of clusters 
is determined by the validity measure V m in- It is found 
that the classification structure starts to deteriorate with 
increasing er, i.e. decreasing signal to noise ratio (snr). 
For a > 0.3, the histogram peaks at five clusters, but the 
overall shape of the histogram is broad, indicating that 
three or six clusters are equally probable ([§]). An over- 
whelming majority of three clusters for cases with high 
noise (a > 0.3) is observed. The reason for the deterio- 
ration is easily explained. With increasing noise, many 
of the noise dominated waveforms, e.g. the mixture si- 
nusoidal waveforms, the pulse train waveforms and the 
ARMA based waveforms look similar and are thus clas- 
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FIG. 8: The figure presents a comparison of computational 
speeds of the analysis pipeline using LCSS and combined 
FTSE with LCSS. While the two methods seem to have com- 
parable computational speeds (i.e. not significantly different) 
up to n < 120, the LCSS+FTSE algorithm is found to be 
more computationally intensive beyond that point. The x- 
axis represents the number of triggers classified and the y-axis 
represents computation time in hours. 



sified as one group. 

As stated earlier in this section, both LCSS and 
LCSS+FTSE showed identical classification structure, 
but LCSS+FTSE is found to be much more computation- 
ally intensive. Thus, we found that there is no real ad- 
vantage at this stage to continue to apply LCSS+FTSE 
to further analysis. We thus continue the application to 
S6 sample data using only LCSS algorithm. 

The analysis yielded 19 significant clusters of triggers. 
As mentioned above, figure [6] shows typical waveform 
from each class. The shape of these waveforms forms the 
basis of classification into distinct clusters by the LCSS 
pipeline. Figure 1101 shows how the four discrete proper- 
ties of the omega triggers (snr, amplitude, frequency and 
Q-value) vary between the different clusters. While the 
amplitude shows least variation, the other three prop- 
erties have wide error bars indicating that the classes 
based on similar waveforms are heterogeneous. We fur- 
ther investigated if the clusters thus produced have any 
significant sub-clusters present in them. 

Figure [TT] shows details of the properties of one of the 
19 classes of triggers found in the test data (class # 10). 
The top panel in this figure shows a typical example 
of trigger waveforms that is classified belonging to this 
group. The panels below the top panel show how some 
of the chief attributes viz., central frequency, the snr, the 
Q-value and the amplitude of the triggers belonging to 
this class are distributed. A very similar picture arises 
for another trigger class in the study [TU (class #14). 

As it appears from the distributions of the properties 
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FIG. 9: The figure represents results of the classification 
structure for the case of simulated triggers when the noise 
level is high (i.e. triggers with low snr). K- means was run 
on the database 500 times and the number of classes deemed 
most significant (by the validity measure) was recorded. For 
a"=0.3, the histogram peaks at five clusters, but the over- 
all shape of the histogram is broad, indicating that three or 
six clusters are equally probable. The results show an over- 
whelming majority of three clusters for cases with high noise 
(a > 0.3). More details are given in section FVl The x-axis 
denotes the number of significant classes. The y-axis shows 
the corresponding number of trials. 
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FIG. 10: The figure shows variation of snr, amplitude, Q- 
value and frequency that describe triggers in the study. While 
the amplitude shows least variation, the other three proper- 
ties have wide error bars indicating that the classes based on 
similar waveforms are heterogeneous. The x-axis represents 
time in arbitrary units. The y-axis represents the respective 
units of each property displayed. 
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FIG. 11: The top panel in this figure shows a typical exam- 
ple of trigger waveforms that is classified belonging to class 
#10. The panels below the top panel show how some of the 
chief attributes viz., central frequency, the snr, the Q- value 
and the amplitude of the triggers belonging to this class are 
distributed. The x-axis represents each of the properties ex- 
pressed in arbitrary units. The y-axis represents numbers. 
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The classification is based on the four dimensional space 
spanned by amplitude, central frequency, Q- value and snr 
of the triggers. Statistical significance of the classifica- 
tion structure thus found has been validated at p < 10~ 6 
level. 

This sequential application of the LCSS and the MHC 
pipelines has proven very useful as far as distunguishing 
different types of triggers from various sources is con- 
cerned. The LCSS pipeline separates out the triggers 
with similarity of waveform, thus taking a first cut at 
fragmenting the huge parameter space into a finite num- 
ber of partially uniform (in terms of waveform) groups. 
The MHC then further explores the finer physical prop- 
erty based groups present in these broader classes. 

Thus, the final outcome of the analysis on the example 
data set in this study shows existence of 19 statistically 
significant classes of triggers with distinct waveforms and 
further classification based on physical properties, com- 
ing from GW channel and different sets of auxiliary and 
environmental systems leads to uncovering more groups 
or clusters of triggers. The combination of the two meth- 
ods yield trigger clusters that would not appear by any 
one of the component algorithms - for example, applica- 
tion of LCSS (which yields classes based on waveforms) 
alone would give 19 distinct classes, while application 
of only the MHC analysis would have given at most 3 
statistically significant classes (being constrained by the 
dimensional of the parameter space.) Since each of the 
subgroups contain triggers with very characteristic prop- 
erties and can be related to a specific set of channels, the 
method proves useful in classification of triggers seen in 
GW data and in helping with tracking down the sources 
or origins of the triggers. 

As a direct application to detector characterization, 
we can classify the triggers seen in GW science data into 
different groups with characteristic properties, related to 
specific groups of channels. We can thus study the pat- 
tern trend of various kinds of triggers and gain insight 
into how some of the channels may be reponsible in pro- 
duction of specific types of triggers. 



FIG. 12: The top panel in this figure shows a typical exam- 
ple of trigger waveforms that is classified belonging to class 
#14. The panels below the top panel show how some of the 
chief attributes viz., central frequency, the snr, the Q- value 
and the amplitude of the triggers belonging to this class are 
distributed. The x-axis represents each of the properties ex- 
pressed in arbitrary units. The y-axis represents numbers. 



A. Post classification analysis 

In the following examples, we will use the trigger 
classes 10 and 14 and sub-groups of triggers found therein 
for illustration of the post classification analysis. 



1. Auxiliary and Environmental channel connection 



like amplitude, central frequency, Q-value and snr for 
each group, the groups are fairly diverse within itself, 
even though they seem to indicate similar types of wave- 
forms. This prompts a further look into the groups them- 
selves. Here, we have implemented the MHC method, 
previously developed and described in detail in [ll| [HI ■ 



Once the clusters of triggers are determined and the 
class members are assigned, the next question we ask is 
what are the possible couplings of these trigger clusters 
to the different sub-systems of the detector, i.e. what 
are the possible sources of these triggers? We tackle this 
problem by using the Omega scans [151 ] . This information 
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FIG. 13: The figure shows how a trigger in the GW chan- 
nel and corresponding triggers in auxiliary and enviromental 
channels are seen in an omega scan. The top left panel shows 
a trigger in a GW channel between 16 and 32 Hz. Simul- 
taneously, triggers are also seen in the end test mass and 
intermediate test masses in the X and Y arms of the interfer- 
ometer (as seen in the other three panels). The trigges seen 
in the auxiliary channles range in frequency between 8 and 32 
Hz. The omega scans can be done on all available auxiliary 
and environmental channels that have been taking data at the 
time when the GW trigger happened. 
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helps relate the cluster members to triggers seen in the 
auxiliary and environmental channels. 

Omega scans are a set of time frequency plots that are 
based on logarithmic tiling of the timefrequency plane 
that detect burst like signals in auxiliary and environ- 
mental channel data from the GW detectors. The first 
part is an application of the dyadic wavelet transform 
and the the second is a somewhat modified windowed 
Fourier transform that tiles the timefrequency plane for 
a specific Q- value. 

Omega scans corresponding to triggers in a given class 
are generated and the corresponding auxiliary and envi- 
ronmental channels that showed triggers are noted. Fig- 
ure Q2] shows an example of how a trigger in the GW 
channel and corresponding triggers in auxiliary and en- 
viromental channels are seen in an omega scan. 

A cumulative list of auxiliary and environmental chan- 
nels for each class of triggers is stored. 

Figures [TJ] and [15] show which auxiliary and environ- 
mental channels were seen to have triggered correspond- 
ing to the GW triggers seen in a certain class (#10). Sub- 
class 2 for this case was the larger of the two. Similarly, 
figures [TB] and [T7] show which auxiliary and environmen- 
tal channels were seen to have triggered corresponding to 
the GW triggers seen in another class (#14). 

2. Comparison with existing data quality flags 

Data Quality (DQ) flags @, H| identify time periods in 
GW science data which are not suitable for astrophysical 
searches because of the varying statistical nature of the 



FIG. 14: The figure shows which auxiliary and environmen- 
tal channels were seen to have triggered corresponding to the 
GW triggers seen in a certain class (#10). As mentioned ear- 
lier, this class was seen to contain two statistically significant 
subclasses. This figure shows the auxiliary and environmental 
channels that triggered simultaneously for triggers in subclass 
1. 



noise that is caused by instrumental malfunctioning in 
the detector and its surroundings. DQ flags are deemed 
effective if they can remove high snr glitches from the GW 
data streams. A large number of DQ flags exist within 
the LSC that are linked to various types of glitches and 
events observed in the data stream. We compare the 
auxiliary and environmental channel couplings observed 
in different classes in this study with those that already 
exist within the LSC repository. 

Figure [T5] shows existing DQ flags corresponding to 
the GW triggers seen in class 14. These flags indicate 
that triggers in this class are related to TCS glitches in 
the intermediate and end test masses in the X and Y arms 
of the interferometer, prestabilised laser power, pre-lock 
loss states, a few injections and glitches due to seismic 
reasons or flying aircrafts. When compared to the cou- 
plings observed in the same class of triggers from the cur- 
rent study, we can see that a lot more information can 
be obtained from the coupled channels recorded in the 
current study. One distinction that can be readily made 
is that, the existing DQ flags being most often a result of 
human observation, are indicative of a qualitative cause 
rather than an exhaustive list of all auxiliary and environ- 
mental channels that might have caused the GW trigger. 
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FIG. 15: The figure shows which auxiliary and environmen- 
tal channels were seen to have triggered corresponding to the 
GW triggers seen in a certain class (#10). As mentioned ear- 
lier, this class was seen to contain two statistically significant 
subclasses. This figure shows the auxiliary and environmental 
channels that triggered simultaneously for triggers in subclass 
2. 



FIG. 16: The figure shows which auxiliary and environmen- 
tal channels were seen to have triggered corresponding to the 
GW triggers seen in a certain class (#14). As mentioned ear- 
lier, this class was seen to contain two statistically significant 
subclasses. This figure shows the auxiliary and environmental 
channels that triggered simultaneously for triggers in subclass 
1. 

A combination of the existing data quality flags coupled 
with information from the LCSS+MHC trigger classes 
and their relation to the instrumental and environmen- 
tal activities can furnish a more detailed and complete 
characterization of the triggers seen in GW channel. 

3. Characterization of the trigger classes 

Once all information as outlined above has been ob- 
tained, the trigger classes can be characterized in terms 
of (i) the waveform, (ii) range of physical properties, (iii) 
auxiliary and environmental channel couplings and (iv) 
DQ flags in use. Let us illustrate this using the class 
# 10 and 14 as our example. Table Q] shows the mean 
values and range of snr, frequency and Q- values of the 
sub-groups found in classes 10 and 14 in our example. It 
is clear that the discriminating factor is the frequency. 
The reason that the main LCSS based class splits into 
sub-groups is because the high frequency triggers appear 
as outliers in the four dimensional hierarchical analysis. 

Following figures HH [151 HI an d E3 one can easily 
read off the auxiliary and environmental channel activi- 
ties at the times of the occurrence of these triggers. 
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FIG. 17: The figure shows which auxiliary and environmen- 
tal channels were seen to have triggered corresponding to the 
GW triggers seen in a certain class (#14). As mentioned ear- 
lier, this class was seen to contain two statistically significant 
subclasses. This figure shows the auxiliary and environmental 
channels that triggered simultaneously for triggers in subclass 
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FIG. 18: The figure shows existing DQ flags correspond- 
ing to the GW triggers seen in class 14. The flags indicate 
that triggers in this class are relate to TCS glitches in the 
intermediate and end test masses in the X and Y arms of the 
interferometer, prestabilised laser power, pre-lock loss states, 
a few injections and glitches due to seismic reasons or flying 
aircrafts. 

Thus, the triggers in a given class and sub-group can 
be identified by the the shape of the trigger, the central 
frequency and the coupled channels and flags. The cou- 
pled channels can be ranked (in a statistical sense) by 
the percentage use (A) in a given day, as follows. 

A = Achannel x 100, (25) 

Tch annel 

where ^-channel referes to the fraction of triggers 
seen in a given auxiliary or environmental channel and 
Tchannei referes to the total number of triggers seen in 
all auxiliary and environmental channels. Table [TT1 shows 
the A values for channels coupled to the triggers in our 
example classes (#10 and #14). The percentage use (A) 
can thus serve as a pattern metric for each trigger group 
that comes with a characteristic waveform. In this partic- 
ular example, it is quite evident that, apart from distinct 
shapes of the triggers belonging to the two groups, the 
top auxuliary and environmental channel percentage us- 
ages are different. While bth the classes do show very 
high percentage use for the channel OMCLQPD (Output 
mode cleaner Quadrant Monitor Photodiode), class #14 
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TABLE I: This table shows the characteristics of trigger 
classes in terms of range of physical properties. 
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TABLE II: This table shows the A values for channels coupled 
to the triggers in classes #10 and #14. A full description of 
the channels can be found in [50J. 
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7.3 


PEM-COIL-MAG 


6.2 


ASC-WFS 


6.95 


SUS-ETMX/Y 


5.4 


PEM-BSC 


6.82 


OMC-PZT 


5.4 


ASC-QPDX 


5.9 


ASC-WFS 


5.1 


PEM-EX/Y 


5.24 


ASC-QPDX/Y 


5.0 


PEM-COIL 


4.6 


PEM EX/Y 


4.6 


PEM-LVEA 


4.33 


PEM-BSC 


4.5 


PEM-ISCT 


3.8 


SUS-ITMX/Y 


4.2 


PEM-MX/Y-SEIS 


3.01 


PEM-LVEA 


3.7 


PEM-PSL 


3.01 


PEM-ISCT 


3.2 


TCS-ITMX/Y 


1.31 


PEM-PSL 


1.9 



seems to have triggers caused (in a statistical sense) by 
the PEM (Physical environment monitor) MX and MY 
(Mid-station X and Y arms) Seismic (SEIS) activities 
that are not seen in the class #10 triggers. Seismic ac- 
tivities are recorded in the DQ flags Q that are being 
reported by other monitors. Another difference between 
the two classes is the presence of TCS (Temperature Con- 
trol System) triggers in class #14, indicating that some 
of these triggers might have their origin in the TCS in 
the intermediate test masses in the X and Y arms (ITMX 
and ITMY) of the interferometer. These type of triggers 
are not found in class #10. 



VI. CONCLUSION AND FUTURE DIRECTION 

The study explores methods of time domain GW trig- 
ger classification using the shape parameters of trigger 
waveforms. The study extends to triggers noted in the 
GW channels as well as to all auxiliary and environmen- 
tal channels. The classification into distinct groups is 
one of the most powerful data mining tools for analysis 



of large data sets, as is the case with LIGO science data. 

Two algorithms have been tested here - (i) the LCSS 
and (ii) LCSS+FTSE, with the intent to test the rela- 
tive computational speed and accuracy of classification. 
The different groups of triggers are indicators of certain 
common properties - in this case, similar types of wave- 
forms - and thus can already reduce the dimensional- 
ity of the trigger identification problem by a large fac- 
tor. This algoritm is then followed by MHC analysis. 
The integration of these two methods in a single analysis 
pipeline yields statistically significant classes of triggers 
with different waveform signatures and physical prop- 
erties. These characteristics in turn are related to the 
processes that generate them and thus, classification of 
waveforms help shed light on very important aspects as- 
sociated with tracking down trigger sources in the inter- 
ferometer and its environment. 

The current study was performed on simulated trig- 
gers in absence of noise and also in presence of various 
levels of noise to set benchmarks. The two algorithms 
differed in computational speed but no appreciable dif- 
ference in performance in classifying the triggers accu- 
rately was noticed. The LCSS and LCSS+FTSE showed 
comparable computational speed for small samples (sam- 
ple size < 150). The combined FTSE +LCSS became 
rapidly more expensive with increasing sample size. The 
computation of LCSS from the intersection list of FTSE 
shown in figure [2] is carried out in three nested loops [27| . 
The first loop is used to go to individual cells of the in- 
tersection list. The second is used to check individual 
values of a cell and the third one ensures the order of 
the subsequence which contribute to LCSS. The second 
and third loops could be avoided if their purposes could 
be taken care of while building the intersection list. The 
space complexity of LCSS+FTSE is very high compared 
to LCSS alone and most of the grid cells of the former 
are usually unoccupied. The amount of space could be 
reduced by increasing the threshold value [27j (reduc- 
ing the fineness of the grid) which, unfortunately would 
compromise the accuracy of LCSS. In the future applica- 
tions of the combined FTSE and LCSS, the algorithms 
needs to be efficiently parallelized for treating large sam- 
ple sizes. In case of triggers without noise, as is expected, 
both pipelines yielded an accurate classification struc- 
ture, with each type of trigger being classified into the 
right cluster. 

In case of triggers embedded in noise, the classifica- 
tion structure started to change from the true number of 
classes present in the data from snr <16. This happens 
because, with increasing noise, many of the characteristic 
trigger waveform features get masked by the mixed noise. 
This is illustrated in figure [9] and explained in detail in 
sectionfV] This is the reason that a careful trigger extrac- 
tion pipeline has been employed in classification of the 
real LIGO S6 triggers. LIGO data is noise dominated, 
and sources of noise from auxiliary and environmental 
channels are many. A carefully constructed condition- 
ing algorithm as described in section IIV CI ensures that 
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the triggers are extracted with as much accuracy as pos- 
sible, minimizing the noise content. This enhances the 
classification accuracy. 

An important observation here is that classes pro- 
duced based on similar waveforms can be heterogeneous 
in terms of its physical properties e.g. amplitude, Q- 
value, central frequency and snr of the triggers. Applica- 
tion of the MHC to look for further sub-classes in each 
of the waveform based classes is a very useful and pro- 
ductive step. This is the first time such an analysis is 
performed in the context of GW. As has been found in 
the study,the waveform based classes of triggers can be 
related to a group of auxiliary and environmental chan- 
nels that are seen to have same type of triggers and the 
sub-classes can then further split the individual classes to 
indicate which channels are most likely to be responsible 
for the production of these triggers. As stated in section 
FVl since each of the subgroups contain triggers with very 
characteristic properties and can be related to a specific 
set of channels, the method proves useful in classification 
of triggers seen in gravitational wave data and in helping 
with tracking down the sources or origins of the triggers. 
This can be a very effective means of characterizing the 
triggers and cataloging their properties and sources. 



The final outcome of the analysis shows existence of 
19 statistically significant classes of triggers (in the data 
segments analyzed) with distinct waveforms and many 
more sub-classes with characteristic physical properties 
coming from GW channel, that can be coupled to differ- 
ent sets of auxiliary and environmental systems. Appli- 
cation of LCSS (which yields classes based on waveforms) 
alone would give 19 distinct classes, while application of 
only the MHC analysis would have given at most four 
statistically significant classes. 

Over a sufficient length of time, enough knowledge base 
of trigger information can be compiled such that triggers 
can be identified without delay in a response to near real- 
time (low latency) needs of advanced LIGO's (33M351 loll] 
detector characterization. 
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